Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search indexing with storage #5854

Merged
merged 11 commits into from
Aug 8, 2019

Conversation

davidfischer
Copy link
Contributor

@davidfischer davidfischer commented Jun 27, 2019

Changes

  • Indexes HTMLFiles and ImportedFiles from storage rather than from local disk
  • Reads intersphinx data from storage (over HTTP) rather than local disk
  • This change (as currently written) would begin to require settings.RTD_BUILD_MEDIA_STORAGE. This has implications for RTD corporate as well as development.

Note: "storage" could be a filesystem backed storage rather than a remote cloud storage but basically this uses the Django storage abstraction.

@davidfischer davidfischer added the PR: work in progress Pull request is not ready for full review label Jun 27, 2019
Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I just did a quick look and looks good

@davidfischer
Copy link
Contributor Author

There are going to be some performance implications of this when storage is cloud storage as opposed to local storage. If, however, we stop doing any processing on non-HTML files, the performance difference probably won't matter.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good approach, and would benefit from a few cleanup/speedup ideas I've been meaning to implement. I need to test this locally a bit to understand it fully. Really excited to get this working.

type_='json', version_slug=self.version.slug, include_file=False
)
try:
for fjson_path in fjson_paths:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should be able to get away from this soon. We should be starting to store the proper path of a file after readthedocs/readthedocs-sphinx-ext#62 is merged.

storage_path = version.project.get_storage_path(
type_='html', version_slug=version.slug, include_file=False
)
for root, __, filenames in storage.walk(storage_path):
for filename in filenames:
if filename.endswith('.html'):
model_class = HTMLFile
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add the elif project.cdn_enabled here with an else: continue. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does project.cdn_enabled do exactly and what would it do differently for search?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means we store all ImportedFile's not just HTMLFile's, so we can purge them from the CDN properly when they change.

@codecov
Copy link

codecov bot commented Jul 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@7be01d8). Click here to learn what that means.
The diff coverage is 34.78%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5854   +/-   ##
=========================================
  Coverage          ?   79.26%           
=========================================
  Files             ?      175           
  Lines             ?    10826           
  Branches          ?     1350           
=========================================
  Hits              ?     8581           
  Misses            ?     1891           
  Partials          ?      354
Impacted Files Coverage Δ
readthedocs/projects/models.py 82.49% <0%> (ø)
readthedocs/projects/tasks.py 64.96% <25%> (ø)
readthedocs/search/parse_json.py 71.42% <60%> (ø)
readthedocs/builds/storage.py 80.35% <78.57%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be01d8...bdc84bd. Read the comment docs.

@davidfischer davidfischer removed the PR: work in progress Pull request is not ready for full review label Jul 16, 2019
@davidfischer
Copy link
Contributor Author

This should be good for a full review. The test failure seems to be in master and is unrelated.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good 🎉. In general, this should just work locally and in prod, right? Locally & .com it will just use the filesystem, but in prod we'll be hitting backend storage, which might cause some performance issues?

This is going to conflict with a bit of the logic we added in the search updates around GSOC, so I'm going to wait to merge this until after that is merged. It shouldn't be too much, we just had to change a bit of the logic around the ordering of creating HTMLFile's & SphinxDomain's and indexing them.

:param commit: Commit that updated path
:param build: Build id
"""

if not settings.RTD_BUILD_MEDIA_STORAGE:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should likely log something here, so we don't get confused if indexing isn't working.

@@ -234,7 +234,7 @@ def USE_PROMOS(self): # noqa

# Optional Django Storage subclass used to write build artifacts to cloud or local storage
# https://docs.readthedocs.io/en/stable/settings.html#build-media-storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link needs to be updated, and the default changed.

@codecov
Copy link

codecov bot commented Jul 18, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@7be01d8). Click here to learn what that means.
The diff coverage is 34.78%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5854   +/-   ##
=========================================
  Coverage          ?   79.26%           
=========================================
  Files             ?      175           
  Lines             ?    10826           
  Branches          ?     1350           
=========================================
  Hits              ?     8581           
  Misses            ?     1891           
  Partials          ?      354
Impacted Files Coverage Δ
readthedocs/projects/models.py 82.49% <0%> (ø)
readthedocs/projects/tasks.py 64.96% <25%> (ø)
readthedocs/search/parse_json.py 71.42% <60%> (ø)
readthedocs/builds/storage.py 80.35% <78.57%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be01d8...bdc84bd. Read the comment docs.

@davidfischer
Copy link
Contributor Author

I think this is good for a full review. For .org and for development, this should just work without issue. For .com, I think settings.RTD_BUILD_MEDIA_STORAGE needs to be unset on the build machines (they shouldn't copy files anywhere, the syncers will do that) and should be the default BuildMediaFileSystemStorage on the web machines where indexing happens.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to get this shipped and delete JSON from our disks. 🎉

Should be good to merge w/ conflicts fixed.

@davidfischer
Copy link
Contributor Author

It looks like some of the test directories changed in the last week and that's causing the test failures. I'll get them fixed up.

- Fixed a bug involving the path on imported files
- Fixed tests to account for extra files in test dir
@davidfischer
Copy link
Contributor Author

Ok, I believe the tests should pass. The added tests did actually uncover a bug in this implementation. The paths generated were not quite right.

@ericholscher ericholscher merged commit b997f31 into master Aug 8, 2019
@ericholscher ericholscher deleted the davidfischer/search-indexing-with-storage branch August 8, 2019 17:35
@ericholscher
Copy link
Member

Great, I'll get this shipped today 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants